Methods for Data Science Coursework 3

Juliette Maiko Limozin

CID: 01343907

Deadline: 10/01/2020, 5 pm

Task 1: Unsupervised learning: text documents with an associated citation graph (45 marks)

1.1 Clustering of the feature matrix (15 marks):

In [1]:
import sklearn
import networkx as nx
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
import warnings
import pickle

fmat = pd.read_csv("feature_matrix.csv", header = None)

amat = pd.read_csv("adjacency_matrix.csv", header = None)

We first proceed by running a k-means algorithm on the feature matrix F, for all values of k in [2,30]. For that I use the cluster.KMeans function in sklearn:

In [2]:
from sklearn.cluster import KMeans
f_kmeans = [0]*29 #define empty list to store the labels created by the algorithm in
ch = np.zeros(29) #define empty array to store CH scores
for i in range(2,31): #loop for all values of k
    kmeans = KMeans(n_clusters = i).fit(fmat) #fit algorithm
    f_kmeans[i-2] = kmeans.labels_ #predicted label by algorithm
    ch[i-2] = sklearn.metrics.calinski_harabasz_score(fmat, f_kmeans[i-2]) #calculate CH score for each k

Let us look at the CH score of each clustering:

In [3]:
plt.plot(range(2,31), ch) #plot Ch as a function of k

plt.xlabel("k")
plt.ylabel("CH score")

plt.title("CH score as a function of k")

plt.show
Out[3]:
<function matplotlib.pyplot.show(*args, **kw)>
In [4]:
df = pd.DataFrame({'CH': ch})
k = df.index[df.CH < 7][0]
print('The first value k with a CH score of less than 7 is k = ', k)
The first value k with a CH score of less than 7 is k =  20

So let us choose this value as our optimal clustering.

Let us look at the distribution of cluster sizes for this optimal clustering.

In [5]:
plt.hist(f_kmeans[k-2])
plt.title('Distribution of cluster sizes')
plt.show()

The 9th cluster seems to have be the largest group, followed by the second cluster.

The CH score evaluates the cluster validity based on the average between- and withincluster sum of squares.

Some other interesting statistics to look at for cluster analysis are the Davies-Bouldin and Silhouette indexes. A low Davies-Bouldin score and a high Silhouette score indicate towards a good clustering.

In [6]:
print('Davies-Bouldin index: ',sklearn.metrics.davies_bouldin_score(fmat, f_kmeans[k-2]))
print('Silhouette index: ',sklearn.metrics.silhouette_score(fmat, f_kmeans[k-2], metric='euclidean'))
Davies-Bouldin index:  5.626330634776715
Silhouette index:  -0.004756222432451057

It is interesting to note the poor Davies-Bouldin and SIlhouette index we get. This indicates that the ck chosen according to the CH criteria might afterall not be the optimal clustering.

Given the randomness of the KMeans algorithm, my results vary at each rerunning of the cell so they are not very robust; however I define my optimal value k computationally instead of manually defining (say writing k = 20), which makes sure that across the code the value of k is adjusted according to the kmeans algorithm's result.

1.2 Analysis of the citation graph (10 marks)

First we will display the citation graph described by the adjacency matrix, A.

In [7]:
G = nx.from_pandas_adjacency(amat) #define graph
plt.figure(3, figsize=(30,30)) #plot the graph
pos = nx.spring_layout(G) #define position layout of nodes
nx.draw(G, pos, node_size = 45) #draw graph
plt.show()
/opt/anaconda3/lib/python3.7/site-packages/networkx/drawing/nx_pylab.py:579: MatplotlibDeprecationWarning: 
The iterable function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use np.iterable instead.
  if not cb.iterable(width):

Let us then plot the degree distribution of the graph as a histogram:

In [8]:
import collections

degree_sequence = sorted([d for n, d in G.degree()])  # degree sequence
degreeCount = collections.Counter(degree_sequence) #count nodes per degree
deg, cnt = zip(*degreeCount.items())

fig, ax = plt.subplots(figsize = (10,5)) #create plot
plt.bar(deg, cnt, width=0.5, color='b')

plt.title("Degree Histogram")
plt.ylabel("Count")
plt.xlabel("Degree")

plt.show()

As you can see most nodes have a low degree and the histogram suggest the degree distribution ressembles an exponential or normal distribution.

Now let us calculate centrality measures for all nodes. We use three different measure: degree, betweenness and pagerank.

In [9]:
dc = np.array(list(nx.degree_centrality(G).items()))[:,1] #define vector od degree centraloty of each node
bc = np.array(list(nx.betweenness_centrality(G).items()))[:,1] #betweenness centrality
pr = np.array(list(nx.pagerank(G).items()))[:,1] #pagerank
nodes = np.array(G.nodes()) #vector of all nodes

fig, (ax1, ax2, ax3) = plt.subplots(3, 1, sharex = 'col', figsize = (10,10)) #plot centrality measures by node

ax1.plot(nodes, dc)
ax1.set_title('Degree centrality')

ax2.plot(nodes, bc)
ax2.set_title('Betweenness centrality')

ax3.plot(nodes, pr)
ax3.set_xlabel('Nodes')
ax3.set_title('Pagerank')

plt.show()

It seems from these graphs that the most highly central node according to degree, betweenness centrality and Pagerank is somewhere around node number 1250. There is actually a function that can verify that for us called Voterank:

In [10]:
from networkx.algorithms.centrality import voterank
voterank(G)[0]
Out[10]:
1245

According to Voterank, node 1245 is the most central node. Let us check that according to the three centrality measures:

In [11]:
ranking = pd.DataFrame({'Node': nodes,
                       'Degree': dc,
                       'Betweenness': bc,
                       'Pagerank':pr})

degree_ranks = pd.DataFrame({'Degree': ranking.sort_values('Degree', ascending = False).Node}).reset_index(drop = True)
betweenness_ranks = pd.DataFrame({'Betweenness': ranking.sort_values('Betweenness', ascending = False).Node}).reset_index(drop = True)
pagerank_ranks = pd.DataFrame({'Pagerank': ranking.sort_values('Pagerank', ascending = False).Node}).reset_index(drop = True)

ranks = pd.concat([degree_ranks, betweenness_ranks, pagerank_ranks], axis = 1)

print('Top five most central nodes according to degree, betweenness and pagerank centrality measures')
print(ranks.head())
Top five most central nodes according to degree, betweenness and pagerank centrality measures
   Degree  Betweenness  Pagerank
0    1245         1245      1245
1     271         1846      1563
2    1563         1894      1846
3    1846         1563       271
4    1672          271      1672

So node 1245 is indeed the most central node according to the three measures.

However if we look at the following most central nodes, the rankings are not the same across the measures.

In [12]:
lab = ['Degree', 'Betweenness', 'Pagerank'] #create vector of axis labels
fig, axs = plt.subplots(3,3, figsize = (15,15)) #create correlation plot
for i in range(3):
    for j in range(3):
        axs[i,j].scatter(ranks.iloc[:,i], ranks.iloc[:,j])
        axs[i,j].set_xlabel(lab[i])
        axs[i,j].set_ylabel(lab[j])
plt.show()

As you can see, the node rankings according to different centrality measures are all uncorrelated to eachother.

This is due to how each measure is calculated.

For degree centrality, the function simply looks at the fraction of edges each node has, where as Pagerank's ranking is based on not only the fraction of edges, but the connectivity of the nodes one node is connected to. Simply put, Pagerank looks at the centrality of one node and the nodes it's connected to whereas degree only looks at the node itself.

Betweenness centrality on the other hand looks at the length of the edges a node is connected to, rather than the number of connection, hence the rankings differing from the other measures.

1.3 Community detection on the citation graph (10 marks)

In this task we use the Clauset-Newman-Moore greedy modularity maximisation algorithm in Networkx to compute the optimal number of communities k* and the corresponding partition of the citation graph.

In [13]:
from networkx.algorithms.community import greedy_modularity_communities
c = list(greedy_modularity_communities(G)) #partition of citation graph
In [14]:
len(c)
Out[14]:
29

The optimal number of communities k* according to the Greedy Modularity algorithm is 29.

I proceed to convert this partitioning into a list of lists to make it easier to hadnle later on.

In [15]:
communities = [0]*29
for i in range(0,29):
    communities[i] = list(c[i])
In [16]:
len(communities) #sense checking
Out[16]:
29

Now let us draw the graph from 1.2 but this time with nodes coloured according to their group:

In [17]:
#create vector of community labels for each node 
label_13 = np.zeros(2485)
for node in nodes:
    for j in range(29):
        if node in communities[j]:
            label_13[node] = j
            
plt.figure(3, figsize = (30,30))
nx.draw(G, pos, node_size = 45) #plot graph from 1.2
for i in range(0,29):
    nx.draw_networkx_nodes(G, pos, node_size = 45, node_color = label_13) 
    #assign node colours by community
plt.show()
In [18]:
top30 = pd.DataFrame(ranks[['Degree', 'Pagerank']].head(30)) 
#fetch top 30 central nodes according to degree and pagerank from previous table

counter_degree = [0]*29 #create vectors of counters of nodes in each community
counter_pagerank = [0]*29
for i in range(29):
    for j in range(30):
        if top30.iloc[j,0] in communities[i]:
            counter_degree[i] = counter_degree[i] + 1 #count nodes
        if top30.iloc[j,1] in communities[i]:
            counter_pagerank[i] = counter_pagerank[i] + 1

#plot dsitribution

plt.plot(range(29), counter_degree, color = 'r', label = 'Degree')
plt.plot(range(29), counter_pagerank, color = 'b', label = 'Pagerank')
plt.legend()
plt.xlabel('Community')
plt.ylabel('Number of nodes')
plt.title('Distribution of top 30 central nodes accross 29 communities')
plt.show()

As we can observe from the graph, the top 30 most central nodes according to Degree and Pagerank are distributed very similarly accorss the 29 communities.

1.4 Compare feature and graph clusterings (10 marks)

First, we use Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) to score how similar the optimal clusterings obtained in 1.1 and 1.3 are to each other.

Let us calculate the AMI score:

In [19]:
from sklearn.metrics.cluster import adjusted_mutual_info_score
print('AMI score: ',adjusted_mutual_info_score(f_kmeans[k-2], label_13))
from sklearn.metrics.cluster import adjusted_rand_score
print('ARI score: ',adjusted_rand_score(f_kmeans[k-2], label_13))
AMI score:  0.17394562307487357
ARI score:  0.05462603868887756
/opt/anaconda3/lib/python3.7/site-packages/sklearn/metrics/cluster/supervised.py:746: FutureWarning: The behavior of AMI will change in version 0.22. To match the behavior of 'v_measure_score', AMI will use average_method='arithmetic' by default.
  FutureWarning)

These two low scores tell us that the classes members are almost completely split across different clusters so the assignment is almost totally incomplete, hence giving us an almost null score.

Here is the graph of the feature clusters obtained in 1.1:

In [20]:
plt.figure(3, figsize = (30,30))
nx.draw(G, pos, node_size = 45) #plot graph from 1.2
for i in range(0,29):
    nx.draw_networkx_nodes(G, pos, node_size = 45, node_color = f_kmeans[k-2]) 
    #assign node colours by community
plt.show()
/opt/anaconda3/lib/python3.7/site-packages/networkx/drawing/nx_pylab.py:579: MatplotlibDeprecationWarning: 
The iterable function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use np.iterable instead.
  if not cb.iterable(width):

Judging on the way each colour of node is spread accorss each graph, the optimal clusterings from 1.1 and 1.3 are visually different.

Let us look at other cluster similarity metrics to compare our two optimal clusterings:

In [21]:
print('Homogeneity score: ',sklearn.metrics.homogeneity_score(f_kmeans[k-2], label_13))
print('Completeness score: ',sklearn.metrics.completeness_score(f_kmeans[k-2], label_13))
print('V-measure score: ',sklearn.metrics.v_measure_score(f_kmeans[k-2], label_13, beta=1.0))
Homoheneity score:  0.21981080820513058
Completeness score:  0.20386305981717548
V-measure score:  0.2115367848895743

The low similarity metrics score further indicate that the two clusters include nodes from different classes.

Task 2: Classification of a set of images (45 marks)

In [22]:
#fetching data
from sklearn.datasets import fetch_openml
mnist = fetch_openml('Fashion-MNIST', cache=False)
X = mnist.data.astype('float32')
y = mnist.target.astype('int64')

2.1 Unsupervised clustering of the image dataset (20 marks)

In this task we use the k-means algorithm again, to cluster the Fashion-MNIST dataset (just the 'training' part) into k classes, for all values of k between k=2 and k=30.

To later evaluate whether the existence of 10 classes in the data, I also calculate metrics discussed in task 1.1 such as CH score, Davies-Bouldin score and Silhouette score for each clustering.

In [24]:
from sklearn.model_selection import train_test_split
#Split the data set into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)
In [25]:
train_kmeans = [0]*29
train_ch = np.zeros(29)
train_db = np.zeros(29)
train_silhouette = np.zeros(29)

for i in range(2,31): #loop for all values of k
    kmeans = KMeans(n_clusters = i).fit(X_train) #fit algorithm
    train_kmeans[i-2] = kmeans.labels_
    if i == 10:
        centers = np.array(kmeans.cluster_centers_)
    
    train_ch[i-2] = sklearn.metrics.calinski_harabasz_score(X_train, train_kmeans[i-2])
    train_db[i-2] = sklearn.metrics.davies_bouldin_score(X_train, train_kmeans[i-2])
    train_silhouette[i-2] = sklearn.metrics.silhouette_score(X_train, train_kmeans[i-2], metric='euclidean')
    print('Iterations done: ', i)
Iterations done:  2
Iterations done:  3
Iterations done:  4
Iterations done:  5
Iterations done:  6
Iterations done:  7
Iterations done:  8
Iterations done:  9
Iterations done:  10
Iterations done:  11
Iterations done:  12
Iterations done:  13
Iterations done:  14
Iterations done:  15
Iterations done:  16
Iterations done:  17
Iterations done:  18
Iterations done:  19
Iterations done:  20
Iterations done:  21
Iterations done:  22
Iterations done:  23
Iterations done:  24
Iterations done:  25
Iterations done:  26
Iterations done:  27
Iterations done:  28
Iterations done:  29
Iterations done:  30
In [245]:
train_ch = np.load('train_ch.npy') 
train_db = np.load('train_db.npy')
train_silhouette = np.load('train_silhouette.npy')
with open('train_kmeans', 'rb') as f:
     train_kmeans = pickle.load(f)

To tackle the randomness of the k-means algorithm I set a seed to the code, and I will also make sure to not re run the algorithm to define the optimal clustering, and instead directly use the results from the code above.

In [26]:
x = range(2,31)

fig, (ax1, ax2, ax3) = plt.subplots(1, 3 ,figsize = (15, 5))

ax1.plot(x, train_ch)
ax1.set_title('CH score by value of k')
ax1.set_xlabel('K')
ax1.set_ylabel('CH score')

ax2.plot(x, train_db, color = 'r')
ax2.set_title('Davies-Bouldin score by value of k')
ax2.set_xlabel('K')
ax2.set_ylabel('Davies-Bouldin score')

ax3.plot(x, train_silhouette, color = 'g')
ax3.set_title('Silhouette score by value of k')
ax3.set_xlabel('K')
ax3.set_ylabel('Silhouette score')

plt.show()

According to these three graphs, the evidence pointing towards there being 10 clusters is actually quite low. The dip of the Davies-Bouldin score at k = 10 is the only indication that is in favour of of this value of k.

Now taking k = 10 as the clustering, let us loo at the centroids.

In [37]:
plt.scatter(np.array(centers)[:, 0], np.array(centers)[:, 1], c='black', s=50, alpha=0.5)
plt.title('Visualisation of the centroid of each of the 10 clusters')
plt.show()
In [34]:
 
Out[34]:
(52500, 784)

We now use the k-means clustering with k=10 as a kNN classifier for the test set.

In [39]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train_knn = scaler.transform(X_train)
X_test_knn = scaler.transform(X_test)

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

neigh = KNeighborsClassifier().fit(X_train_knn, train_kmeans[8]) #8 = 10 - 2 adjusted index 
                                   #fit with the training and k = 10 data sets
y_pred = neigh.predict(X_test_knn) #predict test set

print('Accuracy on the test data: {0:.3f}%'.format( 100* accuracy_score(y_test, y_pred))) #accuracy score
Accuracy on the test data: 17.966%

2.2 Supervised classification of the training set (25 marks)

2.2.1 MLP neural network supervised classification

In this task we create a Multilayer Perceptor (MLP) network to classify the images into 10 classes.

Most of the code used here is from the previous coursework.

In [40]:
# Import relevant package and functions
import torch
import torch.nn as nn
import torch.utils.data
import torch.nn.functional as F
import time

# Define network set-up
input_size = 784 
hidden_size = 100
num_classes = 10
num_epochs = 30
batch_size = 128
learning_rate = 0.005

#convert dataset
X_train_nn = torch.from_numpy(X_train).float()
y_train_nn = torch.from_numpy(y_train)
X_test_nn = torch.from_numpy(X_test).float()
y_test_nn = torch.from_numpy(y_test)

Let us define the MLP network:

In [41]:
class MLP(nn.Module):
        def __init__(self):
            super(MLP, self).__init__()
            self.fc1 = nn.Linear(input_size, hidden_size) #this creates 3 hidden layers 
            self.fc2 = nn.Linear(hidden_size, hidden_size)
            self.fc3 = nn.Linear(hidden_size, hidden_size)
            self.fc4 = nn.Linear(hidden_size, num_classes)
            
        def forward(self, x):
            out = F.relu(self.fc1(x)) #ReLu as our activation function
            out = F.relu(self.fc2(out))
            out = F.relu(self.fc3(out))
            out = self.fc4(out)
            return out

I then proceed to create a function that trains the model on the train set X and tests it on the test set y.

In [42]:
def NeuralNet_function(NeuralNet, X_train_nn, y_train_nn, X_test_nn, y_test_nn, 
                       input_size, hidden_size, num_classes,num_epochs, batch_size, learning_rate, details):
    

    # Define a tensor data set and data loader as this is a requirement for PyTorch:
    # The tensor data set converts the data into a tensor and the data loader slices
    # the data into batches, which are feeded into the neural network one after another.
    train_nn = torch.utils.data.TensorDataset(X_train_nn, y_train_nn)
    train_loader = torch.utils.data.DataLoader(train_nn, batch_size=batch_size)
    
    test_nn = torch.utils.data.TensorDataset(X_test_nn, y_test_nn)
    test_loader = torch.utils.data.DataLoader(test_nn, batch_size=batch_size)
    
    # Define a fully connected neural network with two hidden layers, using ReLU as my activation function
    
    # Name the neural network
    net = NeuralNet()

    # Define the loss function as cross-entropy
    criterion = nn.NLLLoss()

    # Fix the optimisation method to be stochastic gradient descent
    optimiser = torch.optim.SGD(net.parameters(), lr=learning_rate)  

    total_step = len(train_loader)
    loss_values = []
    
    #Start timer of the training
    start_time = time.time()
    for epoch in range(num_epochs+1):

      #TRAIN MODEL
      net.train()
      train_loss = 0.0
  
      for i, (inputs, labels) in enumerate(train_loader, 0):
        labels = labels
            
        # forward pass
        outputs = net(inputs)
        m = nn.LogSoftmax(dim = 1)
        loss = criterion(m(outputs), labels)
        
        # backward and optimise
        optimiser.zero_grad()
        loss.backward()
        optimiser.step()

        # update loss
        train_loss += loss.item()

      loss_values.append(train_loss / total_step)

    if details == 'yes': 
        print('Finished Training')
        elapsed_time = int(time.time() - start_time)
        print('Time it took to train model: {:02d}min {:02d}sec'.format((elapsed_time % 3600 // 60), elapsed_time % 60))
    
        plt.figure(figsize=(12,8))
        plt.xlim(-1, num_epochs+1)
        plt.ylabel('Training Loss', fontsize=20)
        plt.xlabel('Epoch', fontsize=20)
        plt.plot(loss_values)
    
        plt.show()
    
    # TEST MODEL
    net.eval()

    correct = 0
    total = 0
    y_predNN = []
    for inputs, labels in test_loader:
      outputs = net(inputs)
      _, predicted = torch.max(outputs.data, 1)
      y_predNN.append(predicted.numpy())
      total += labels.size(0)
      correct += (predicted == labels).sum().item()
    
    y_predNN = np.concatenate(y_predNN)
    accuracy = 100 * correct / total

    if details == 'yes': 
        print('Accuracy of the network on the test set: {0:.3f} %'.format(accuracy))
    
    return y_predNN, accuracy

Let us use this function to train and test the MLP classification I created above:

In [43]:
NeuralNet_function(MLP, X_train_nn, y_train_nn, X_test_nn, y_test_nn, 
                   input_size, hidden_size, num_classes,num_epochs, batch_size, learning_rate, details = 'yes')
Finished Training
Time it took to train model: 00min 33sec
Accuracy of the network on the test set: 87.537 %
Out[43]:
(array([4, 7, 9, ..., 3, 1, 3]), 87.53714285714285)

We get a nice accuracy score on the test set of 88% with a quick convergence of the training loss. The computational time is also acceptable.

2.2.2 Convolutional neural network (CNN) supervised classification

Next we do the same steps as above, but this time with a convolutional neural network.

In [44]:
class ConvNet(nn.Module):
    def __init__(self):
        super(ConvNet, self).__init__()
        self.conv1 = nn.Conv2d(1,6, kernel_size = 5, stride = 1, dilation = 1) #First convolution layer
        self.conv2 = nn.Conv2d(6,16, kernel_size = 5, stride = 1, dilation = 1) #Second convolution layer
        self.fc1 = nn.Linear(16*4*4, 120) #Hidden layers
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
        
    def forward(self, x):
        out = x.view(-1,1,28,28)
        out = F.relu(self.conv1(out)) #Using ReLu as activation function
        out = F.max_pool2d(out, kernel_size = 2, stride = 2) #pooling layer
        out = F.relu(self.conv2(out))
        out = F.max_pool2d(out, kernel_size = 2, stride = 2) #pooling layer
        out = out.view(-1, 16*4*4)
        out = F.relu(self.fc1(out))
        out = F.relu(self.fc2(out))
        out = self.fc3(out)
        return out
In [45]:
NeuralNet_function(ConvNet, X_train_nn, y_train_nn, X_test_nn, y_test_nn, 
                   input_size, hidden_size, num_classes,num_epochs, batch_size, learning_rate, details = 'yes')
Finished Training
Time it took to train model: 05min 17sec
Accuracy of the network on the test set: 88.423 %
Out[45]:
(array([4, 7, 9, ..., 3, 1, 3]), 88.42285714285714)

This time the accuracy score is even better than for MLP classification with a similar training loss convergence, but the computational time is much more significant.

2.2.3 Comparisons of the classifiers

As we saw above, even though the CNN classifier gives a slightly better accuracy score on the test set than the MLP classifier, the computational time is much worse.

However there are more parameters considered in MLP classification because it is a fully connected model, as opposed to CNN classification which is a sparsely or partially connected model. Each node in MLP is connected to another in a very dense web — resulting in redundancy and inefficiency.

The higher accuracy in CNN classification is also due to the fact that it can do a deeper dive into the data with the added layers.

Now these accuracy scores are much higher than that of the kNN classification. Unsupervised classification is useful for quickly assigning labels to uncomplicated, broad classes so the computational cost is low indeed, but the accuracy of the classification is hence quite low. Supervised classification allows us to fine tune the information classes, resulting in much more accurate results, but a longer computational time.

Now, let us try to increase the CNN classifier's accuracy on the test score above 90%. To do so we will tweak a few aspects of the architecture of the CNN, and test each modification using a 5-fold cross validation alogrithm. We will use the modified CNN with the higher mean validation score in the cross validation as the best modification, and test its accuracy.

In [46]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle = True, random_state = 999) 

def NN_crossval(ConvNet):
    accuracy_validation = np.zeros(5)
    j = 0

    for train_index, test_index in skf.split(X_train, y_train): 
        
        #Convert data to tensors for neural network
        X_train_nn = torch.from_numpy(X_train[train_index]).float()
        y_train_nn = torch.from_numpy(y_train[train_index])
        X_test_nn = torch.from_numpy(X_train[test_index]).float()
        y_test_nn = torch.from_numpy(y_train[test_index])
        
        y_pred, accuracy_validation[j] = NeuralNet_function(ConvNet,
                                                            X_train_nn, y_train_nn, X_test_nn, 
                                                            y_test_nn, input_size, hidden_size, 
                                                            num_classes,num_epochs, batch_size, learning_rate,
                                                           details = 'no')
        j = j+1
    
    mean_validation_score = np.mean(accuracy_validation)
    return mean_validation_score

First I try a CNN with kernel of size 4 in convolution layers.

In [47]:
class ConvNet1(nn.Module):
    def __init__(self):
        super(ConvNet1, self).__init__()
        self.conv1 = nn.Conv2d(1,6, kernel_size = 4, stride = 1, dilation = 1)
        self.conv2 = nn.Conv2d(6,16, kernel_size = 4, stride = 1, dilation = 1)
        self.fc1 = nn.Linear(16*4*4, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
        
    def forward(self, x):
        out = x.view(-1,1,28,28)
        out = F.relu(self.conv1(out))
        out = F.max_pool2d(out, kernel_size = 2, stride = 2)
        out = F.relu(self.conv2(out))
        out = F.max_pool2d(out, kernel_size = 2, stride = 2)
        out = out.view(-1, 16*4*4)
        out = F.relu(self.fc1(out))
        out = F.relu(self.fc2(out))
        out = self.fc3(out)
        return out
In [48]:
NN_crossval(ConvNet1)
Out[48]:
86.36573195430688

We actually get a lower accuracy score so lowering the kernel size is not optimal.

Next, I try adding a hiddhen layer to make the network deeper.

In [49]:
class ConvNet2(nn.Module):
    def __init__(self):
        super(ConvNet2, self).__init__()
        self.conv1 = nn.Conv2d(1,6, kernel_size = 5, stride = 1, dilation = 1)
        self.conv2 = nn.Conv2d(6,16, kernel_size = 5, stride = 1, dilation = 1)
        self.fc1 = nn.Linear(16*4*4, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 62)
        self.fc4 = nn.Linear(62,10)
        
    def forward(self, x):
        out = x.view(-1,1,28,28)
        out = F.relu(self.conv1(out))
        out = F.max_pool2d(out, kernel_size = 2, stride = 2)
        out = F.relu(self.conv2(out))
        out = F.max_pool2d(out, kernel_size = 2, stride = 2)
        out = out.view(-1, 16*4*4)
        out = F.relu(self.fc1(out))
        out = F.relu(self.fc2(out))
        out = F.relu(self.fc3(out))
        out = self.fc4(out)
        return out
In [50]:
NN_crossval(ConvNet2)
Out[50]:
88.34099554900867

Not at 90% yet.

Adding a layer was the right intuition though, as we are essentially making the network deeper which allows for finer tuning.

In fact, the CNN above has been modified and test over a 100 times and I never achieved a score higher than 90%. I tried everything from drop out between layers, epoch, batch size, changing the activation function.

This is the model that achieved the highest score so far, so I'm afraid we will have to settle for this model.

In [54]:
NeuralNet_function(ConvNet2, X_train_nn, y_train_nn, X_test_nn, y_test_nn, 
                   input_size, hidden_size, num_classes,num_epochs, batch_size, learning_rate, details = 'yes')
Finished Training
Time it took to train model: 06min 40sec
Accuracy of the network on the test set: 88.103 %
Out[54]:
(array([4, 7, 9, ..., 3, 1, 3]), 88.10285714285715)
In [ ]: